reward probability
From Classical Data to Quantum Advantage -- Quantum Policy Evaluation on Quantum Hardware
Hein, Daniel, Wiedemann, Simon, Baumann, Markus, Felbinger, Patrik, Klein, Justin, Schieder, Maximilian, Stein, Jonas, Schuman, Daniëlle, Cope, Thomas, Udluft, Steffen
IQM Germany Abstract--Quantum policy evaluation (QPE) is a reinforcement learning (RL) algorithm which is quadratically more efficient than an analogous classical Monte Carlo estimation. It makes use of a direct quantum mechanical realization of a finite Markov decision process, in which the agent and the environment are modeled by unitary operators and exchange states, actions, and rewards in superposition. Previously, the quantum environment has been implemented and parametrized manually for an illustrative benchmark using a quantum simulator . In this paper, we demonstrate how these environment parameters can be learned from a batch of classical observational data through quantum machine learning (QML) on quantum hardware. The learned quantum environment is then applied in QPE to also compute policy evaluations on quantum hardware. Our experiments reveal that, despite challenges such as noise and short coherence times, the integration of QML and QPE shows promising potential for achieving quantum advantage in RL.
An Empirical Evaluation of Thompson Sampling
Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against.
Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
How humans achieve long-term goals in an uncertain environment, via repeated trials and noisy observations, is an important problem in cognitive science. We investigate this behavior in the context of a multi-armed bandit task. We compare human behavior to a variety of models that vary in their representational and computational complexity. Our result shows that subjects' choices, on a trial-totrial basis, are best captured by a "forgetful" Bayesian iterative learning model [21] in combination with a partially myopic decision policy known as Knowledge Gradient [7]. This model accounts for subjects' trial-by-trial choice better than a number of other previously proposed models, including optimal Bayesian learning and risk minimization, ε-greedy and win-stay-lose-shift. It has the added benefit of being closest in performance to the optimal Bayesian model than all the other heuristic models that have the same computational complexity (all are significantly less complex than the optimal model). These results constitute an advancement in the theoretical understanding of how humans negotiate the tension between exploration and exploitation in a noisy, imperfectly known environment.
Designing Optimal Behavioral Experiments Using Machine Learning
Valentin, Simon, Kleinegesse, Steven, Bramley, Neil R., Seriès, Peggy, Gutmann, Michael U., Lucas, Christopher G.
Computational models are powerful tools for understanding human cognition and behavior. They let us express our theories clearly and precisely, and offer predictions that can be subtle and often counter-intuitive. However, this same richness and ability to surprise means our scientific intuitions and traditional tools are ill-suited to designing experiments to test and compare these models. To avoid these pitfalls and realize the full potential of computational modeling, we require tools to design experiments that provide clear answers about what models explain human behavior and the auxiliary assumptions those models must make. Bayesian optimal experimental design (BOED) formalizes the search for optimal experimental designs by identifying experiments that are expected to yield informative data. In this work, we provide a tutorial on leveraging recent advances in BOED and machine learning to find optimal experiments for any kind of model that we can simulate data from, and show how by-products of this procedure allow for quick and straightforward evaluation of models and their parameters against real experimental data. As a case study, we consider theories of how people balance exploration and exploitation in multi-armed bandit decision-making tasks. We validate the presented approach using simulations and a real-world experiment. As compared to experimental designs commonly used in the literature, we show that our optimal designs more efficiently determine which of a set of models best account for individual human behavior, and more efficiently characterize behavior given a preferred model. At the same time, formalizing a scientific question such that it can be adequately addressed with BOED can be challenging and we discuss several potential caveats and pitfalls that practitioners should be aware of. We provide code and tutorial notebooks to replicate all analyses.
Theory of Acceleration of Decision Making by Correlated Time Sequences
Okada, Norihiro, Yamagami, Tomoki, Chauvet, Nicolas, Ito, Yusuke, Hasegawa, Mikio, Naruse, Makoto
Photonic accelerators have been intensively studied to provide enhanced information processing capability to benefit from the unique attributes of physical processes. Recently, it has been reported that chaotically oscillating ultrafast time series from a laser, called laser chaos, provide the ability to solve multi-armed bandit (MAB) problems or decision-making problems at GHz order. Furthermore, it has been confirmed that the negatively correlated time-domain structure of laser chaos contributes to the acceleration of decision-making. However, the underlying mechanism of why decision-making is accelerated by correlated time series is unknown. In this study, we demonstrate a theoretical model to account for accelerating decision-making by correlated time sequence. We first confirm the effectiveness of the negative autocorrelation inherent in time series for solving two-armed bandit problems using Fourier transform surrogate methods. We propose a theoretical model that concerns the correlated time series subjected to the decision-making system and the internal status of the system therein in a unified manner, inspired by correlated random walks. We demonstrate that the performance derived analytically by the theory agrees well with the numerical simulations, which confirms the validity of the proposed model and leads to optimal system design. The present study paves the way for improving the effectiveness of correlated time series for decision-making, impacting artificial intelligence and other applications.
Contextual Bandits Evolving Over Finite Time
Deshpande, Harsh, Jain, Vishal, Moharir, Sharayu
--Contextual bandits have the same exploration-exploitation tradeoff as standard multi-armed bandits. On adding positive externalities that decay with time, this problem becomes much more difficult as wrong decisions at the start are hard to recover from. We explore existing policies in this setting and highlight their biases towards the inherent reward matrix. We propose a rejection based policy that achieves a low regret irrespective of the structure of the reward probability matrix. In the context of restaurant recommendation systems, users can generally be classified into multiple user types with different preferences for different restaurants.
Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function
Tamatsukuri, Akihiro, Takahashi, Tatsuji
As reinforcement learning algorithms are being applied to increasingly complicated and realistic tasks, it is becoming increasingly difficult to solve such problems within a practical time frame. Hence, we focus on a \textit{satisficing} strategy that looks for an action whose value is above the aspiration level (analogous to the break-even point), rather than the optimal action. In this paper, we introduce a simple mathematical model called risk-sensitive satisficing ($RS$) that implements a satisficing strategy by integrating risk-averse and risk-prone attitudes under the greedy policy. We apply the proposed model to the $K$-armed bandit problems, which constitute the most basic class of reinforcement learning tasks, and prove two propositions. The first is that $RS$ is guaranteed to find an action whose value is above the aspiration level. The second is that the regret (expected loss) of $RS$ is upper bounded by a finite value, given that the aspiration level is set to an "optimal level" so that satisficing implies optimizing. We confirm the results through numerical simulations and compare the performance of $RS$ with that of other representative algorithms for the $K$-armed bandit problems.
Been There, Done That: Meta-Learning with Episodic Recall
Ritter, Samuel, Wang, Jane X., Kurth-Nelson, Zeb, Jayakumar, Siddhant M., Blundell, Charles, Pascanu, Razvan, Botvinick, Matthew
Meta-learning agents excel at rapidly learning new tasks from open-ended task distributions; yet, they forget what they learn about each task as soon as the next begins. When tasks reoccur - as they do in natural environments - metalearning agents must explore again instead of immediately exploiting previously discovered solutions. We propose a formalism for generating open-ended yet repetitious environments, then develop a meta-learning architecture for solving these environments. This architecture melds the standard LSTM working memory with a differentiable neural episodic memory. We explore the capabilities of agents with this episodic LSTM in five meta-learning environments with reoccurring tasks, ranging from bandits to navigation and stochastic sequential decision problems.
Category Theoretic Analysis of Photon-based Decision Making
Naruse, Makoto, Kim, Song-Ju, Aono, Masashi, Berthel, Martin, Drezet, Aurélien, Huant, Serge, Hori, Hirokazu
Decision making is a vital function in this age of machine learning and artificial intelligence, yet its physical realization and theoretical fundamentals are still not completely understood. In our former study, we demonstrated that single-photons can be used to make decisions in uncertain, dynamically changing environments. The two-armed bandit problem was successfully solved using the dual probabilistic and particle attributes of single photons. In this study, we present a category theoretic modeling and analysis of single-photon-based decision making, including a quantitative analysis that is in agreement with the experimental results. A category theoretic model reveals the complex interdependencies of subject matter entities in a simplified manner, even in dynamically changing environments. In particular, the octahedral and braid structures in triangulated categories provide a better understanding and quantitative metrics of the underlying mechanisms of a single-photon decision maker. This study provides both insight and a foundation for analyzing more complex and uncertain problems, to further machine learning and artificial intelligence.
The Importance of Constraint Smoothness for Parameter Estimation in Computational Cognitive Modeling
Nunes, Abraham, Rudiuk, Alexander
Psychiatric neuroscience is increasingly aware of the need to define psychopathology in terms of abnormal neural computation. The central tool in this endeavour is the fitting of computational models to behavioural data. The most prominent example of this procedure is fitting reinforcement learning (RL) models to decision-making data collected from mentally ill and healthy subject populations. These models are generative models of the decision-making data themselves, and the parameters we seek to infer can be psychologically and neurobiologically meaningful. Currently, the gold standard approach to this inference procedure involves Monte-Carlo sampling, which is robust but computationally intensive---rendering additional procedures, such as cross-validation, impractical. Searching for point estimates of model parameters using optimization procedures remains a popular and interesting option. On a novel testbed simulating parameter estimation from a common RL task, we investigated the effects of smooth vs. boundary constraints on parameter estimation using interior point and deterministic direct search algorithms for optimization. Ultimately, we show that the use of boundary constraints can lead to substantial truncation effects. Our results discourage the use of boundary constraints for these applications.